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1. INTRODUCTION 

Humans are social creatures where one of their fundamental needs is interaction with each other. 
This study identifies emotions in voice or speech. According to Akçay and Oguz [1], conversation or speech 
is one of the most natural ways to express oneself to humans. Hence it naturally becomes one of the types of 
interactions that is analyzed using computer technology. For this reason, it is a significant problem to 
determine the emotions inside a speech. In a publication by Sarma et al. [2], humans expressed their 
emotions in speech through implicit or indirect ways such as intonation or tone of voice. The speech emotion 
recognition (SER) method detects emotions in speech by examining various aspects. SER is a way to explore 
the human emotional state by using a computer to investigate a speech signal [3]. SER has many 
implementation cases in our lives, such as in a company call center to detect user satisfaction and in an 
emergency call center to detect the user's emotional condition so that responders can help provide the correct 
response and aid users [4]. 

SER is a supervised method to determine the class of emotions possessed in speech. Based on 
Akçay and Oğuz [1], there are three main types of SER classifiers which are classical classifiers [5]—[10], 
classifiers based on deep learning [11]-[14], and deep learning based enhancement techniques [15]-[19]. In 
this research, the newest type of SER classifier, which is the deep learning based enhancement techniques as 
SER classification method, is explored because there is still paper regarding this technique recently [20]. 
SER needs a data source to predict emotions. Three categories of data sources are simulated, elicited, and 
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spontaneous [21]. Firstly, a simulated dataset means that there are scripts that the actors are obliged to follow. 
Secondly, an elicited dataset implies that there are scenarios that the actors need to improvise from them. 
Thirdly, a spontaneous dataset means that the data were collected from real-life situations. A dataset is 
created from one or more types of data sources. This study used one of the most common English speech 
datasets which is the interactive emotional dyadic motion capture JEMOCAP) [21]. IEMOCAP is comprised 
of two data source types which are simulated and elicited [22]. According to Lieskovská et al. [21], 
IEMOCAP has 10 subjects (5 female and 5 male) with 10,039 speech utterances and 4 modalities which are 
audio, video, text, and motion capture of face (MOCAP). 

In building the SER model, a few aspects must be considered such as the classifier method, the 
modalities, and the speech data. According to Khalil et al. [3], a deep learning SER classifier fuses feature 
extraction, and feature classification into one phase resulting in a more efficient way than classical classifiers 
such as artificial neural networks (ANN). Sarma et al. [2] proved that using a deep learning classifier and 
combining it with the attention mechanism could improve the model's accuracy. The model configuration of 
time delay neural network (TDNN) long short term memory (LSTM) attention improved the weighted 
accuracy (WA) from 59,5%, which used the TDNN-LSTM model, to 66,3%. To further support the previous 
work, [23] used a transformer model based on an attention mechanism to achieve the WA of 68,1%. 
Kumar et al. [24] achieved better accuracy by using bidirectional encoder representations from transformers 
(BERT) as a pre-trained transformer model with the WA of 71,7%. An SER model is unimodal if the model 
only uses one modal such as audio. On the contrary, a SER model is multimodal if the model uses two or 
more modals such as audio and text. Based on N and Patil [25], using a multimodal SER could improve the 
unweighted accuracy (UA) of the IEMOCAP dataset from 65,9% for text-only modal (self-attention-LSTM) 
to 72,82% using cross-modal attention for multimodal SER which uses text and audio modal. The inputted 
speech data will determine the SER model's performance. Chatziagapi et al. [26] introduced generative 
adversarial networks (GAN) as a data augmentation technique for SER. Firstly, Chatziagapi et al. [26] 
created an imbalanced dataset scenario on the IEMOCAP dataset by deleting 80% of samples from the angry, 
sad, and happy emotion classes and retaining the neutral emotion class samples. Then, GAN is used to 
augment new speech samples to balance the dataset. A balanced dataset means that the count of speech 
samples for each emotion class is the same for every emotion class. This technique improves the SER model 
accuracy of unweighted average recall (UAR) from 52,3% to 54,6% and F-score from 52,7% to 55% [26]. 

Based on the previous studies, two problems are explored. Firstly, in the SER field, there is an 
imbalanced dataset condition where there are one or more underrepresented emotion classes. Those emotion 
classes have lesser utterances than the other classes, which resulted in worse performance for the SER model 
[26]. The second problem is there still is a chance to improve the SER model performance when using deep 
learning on the IEMOCAP dataset [24]. Therefore, the research is conducted using IEMOCAP as the speech 
dataset and used one of the deep learning based enhancement techniques which is the attention mechanism 
inside the multimodal SER model. The proposed work improves the SER model accuracy with three 
contributions: i) this research uses multimodal SER consisting of audio and text modals instead of using 
unimodal SER [26], ii) this research inferences new audio files using GAN from raw audio instead of 
spectrogram [26], iii) this research changes BERT for the text modal inside the SER model [24] into A Lite 
BERT (ALBERT) and uses GAN as a data augmentation technique. 


2. METHOD 
2.1. Overview 

This work referenced [24] research method of SER phases which used four main steps presented in 
Figure |. There are two types of inputs used in this experiment. First, the speech or audio files for the audio 
modal. Second, the transcript files for the text modal. The final output of the method is the predicted emotion 
class that will be used to evaluate the model performance. 
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Figure 1. Method overvie 


Data augmentation and enhancement for multimodal speech emotion ... (Jonathan Christian Setyono) 


3010 O ISSN: 2302-9285 


Based on Figure 1, the first step is the data input and preprocessing phase. Then, the second step is 
feature extraction phase for audio and text modal. The third step is the SER model training phase. The fourth 
or last step is the evaluation phase of the proposed SER model. This proposed work compares multiple data 
input strategies using no data augmentation technique and GAN as a data augmentation technique to produce 
SER work that reaches better accuracy. 


2.2. Data input and preprocessing 

In the data input and preprocessing phase, the IEMOCAP dataset is prepared to be used in the next 
stage. According to Busso et al. [22], the IEMOCAP dataset has ten emotion classes. The ten emotion classes 
are: Neu=neutral, Hap=happiness, Sad=sadness, Ang=anger, Sur=surprise, Fea=fear, Dis=disgust, 
Fru=frustration, Exc=excited, and Oth=other [22]. Following the previous works of SER that used the 
IEMOCAP dataset [20], [24], [26], this research used four emotion classes. The four emotion classes are 
neutral, happiness (combined with excitement), anger, and sadness. The total samples from the four emotion 
classes are 5,531, consisting of 1,103 for anger, 1,636 for happiness, 1,708 for neutral, and 1,084 for sadness. 
There are two flows utilized to prepare the data. The first flow uses data augmentation. The second one does 
not. 


2.2.1. No data augmentation 

Three configurations of data input from IEMOCAP dataset are used with no data augmentation 
which is shown in Table 1. These data configurations used no GAN data augmentation techniques. Hence, 
every configuration in Table 1 could be said as the original dataset configuration. 


Table 1. Data input configurations from IEMOCAP with no data augmentation 


Configurations Total utterances Samples of each emotion class 
Two sessions [24] 2.108 366 Ang, 605 Hap, 746 Neu, 391 Sad 
Full samples 5.531 1.103 Ang, 1.636 Hap, 1.708 Neu, 1.084 Sad 
4.000 samples 4.000 1.000 Ang, 1.000 Hap, 1.000 Neu, 1.000 Sad 


In Table 1, the first configuration taken from [24] used two out of five sessions from the EMOCAP 
dataset. Meanwhile, the full samples configuration used all the samples from the IEMOCAP dataset, and the 
third configuration used 1.000 utterances for every emotion class in the IEMOCAP dataset. Two sessions 
[24] configuration served as the comparison baseline. 


2.2.2. Data augmentation using GAN 

This experiment used HiFi-GAN as the data augmentation technique [27]. HiFi-GAN achieved the 
best mean opinion score (MOS) of 4.36 compared to other GAN techniques such as WaveNet and MelGAN 
[27]. A better MOS score indicates that the generated audio is more like real audio or human-quality audio. 
HiFi-GAN also produces audio faster than real-time audio by 167.9 times [27]. The data augmentation steps 
using HiFi-GAN in this research can be seen in Figure 2. 
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Figure 2. HiFi-GAN data augmentation 
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Based on Figure 2, the first step of HiFi-GAN data augmentation is selecting the audio input from 
the IEMOCAP dataset. Six configurations of data input from the IEMOCAP dataset are presented in Table 2. 
After inputting the data, the second step is to do speech inference. HiFi-GAN provided UNIVERSAL_V1 
pre-trained model to be used as transfer learning with the IEMOCAP dataset. In this work, the speech 
inference used raw audio files with .wav file extension to synthesize audio files. 


Table 2. Data input configurations from IEMOCAP with data augmentation using GAN 


Configurations GAN Samples Original Samples 
One class GAN 1,000 samples of one emotion class 1,000 samples of three emotion classes 
Two class GAN 1,000 samples of two emotion classes 1,000 samples of two emotion classes 
Three class GAN 1,000 samples of three emotion classes 1,000 samples of one emotion class 
Four class GAN 1,000 samples of four emotion classes 0 samples of four emotion classes 
Two sessions [24]+two class GAN 300 samples of anger & sadness 2.108 samples 
Full samples+two class GAN 500 samples of anger & sadness 5.531 samples 


The one to four class GAN configurations in Table 2 used the 4,000 sample configuration from 
Table 1. The GAN technique replaced one or multiple emotion classes with generated samples. Meanwhile, 
the two last configurations used the two sessions [24] and full samples configurations from Table 1. The 
GAN technique duplicated the original data for the two last configurations. 


2.3. Feature extraction 

Following data insertion, the feature extraction followed the selected features by [24]. There are two 
types of feature extraction for the multimodal SER model. Firstly, audio feature extraction for the audio 
modal. Secondly, text feature extraction for the text modal. Based on Kumar et al. [24], the three audio 
features that are used for the multimodal SER model are 128-dimensional mel-spectrogram, 40-dimensional 
mel-frequency cepstral coefficients (MFCC), and 12-dimensional chroma vectors. Peeters [28] indicated that 
using the chroma vectors feature, a model could capture the regularity of a speech utterance that cannot be 
captured by using spectral features such as mel-spectrogram and MFCC. 

As the text source, [24] used transcription files from the IEMOCAP dataset. The files contain each 
spoken word or utterance that is used as the input data. Then, the transcription files are cleaned from 
stopwords. Pre-trained transformer model of BERT is used to extract the text features [24]. This work also 
added the use of ALBERT to compare the multimodal SER performance. According to Lan et al. [29], 
ALBERT has three differences from BERT that makes it a better model. The advantages are factorized 
embedding parameterization, cross-layer parameter sharing, and inter-sentence coherence loss. Besides 
improving performance, ALBERT also has a smaller model, which leads to lower graphical processing unit 
(GPU) or tensor processing unit (TPU) usage and faster training speed [29]. BERT or ALBERT produced the 
text feature as an encoded input tokens vector with two special tokens of classification (CLS) and separator 
(SEP). This experiment used BERT base and ALBERT base configuration [29]. 


2.4. Multimodal SER model training 

Before the model training, the data from the previous phase is divided into two parts. The train data 
used 80%, and the test data used 20% of the total data. Every configuration is trained with 100 epochs and 
used early stopping when the model accuracy does not improve. Apart from two sessions [24]’s that used 
64 batch sizes, the other configurations used 128 batch sizes, which is the most optimal parameter according 
to the hyperparameter tuning. The training of the multimodal SER model consisted of three parts [24], as 
shown in Figure 3. 

Based on Figure 3 as proposed in [24], in the audio modal, the audio or SER phase accepts 
mel-spectrogram, MFCC, and chroma vectors from the audio feature extraction step. The inputs are 
processed separately using gated recurrent unit (GRU) and attention mechanism to extract the parts that 
contain the most significant emotional information. The speech vectors are utilized in the multimodal 
emotion recognition phase. The text emotion recognition phase accepts the encoded speech token generated 
by BERT or ALBERT to produce encoded hidden vectors that will be used in the multimodal emotion 
recognition phase. The last part, which is the multimodal emotion recognition phase, processed each of the 
outputs from the audio and text modal by concatenating them into one final vector. The final vector will 
determine the predicted emotion class. The Adam optimizer and the sparse categorical cross-entropy loss 
function are used to train the multimodal SER model [24]. 
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Figure 3. Multimodal SER model 


2.5. Evaluation 

This research utilized four multi-class evaluation metrics because the proposed SER model used the 
IEMOCAP dataset to classify four emotion classes. Two evaluation metrics used are accuracy and Fl-score. 
Fl-score are calculated using the unweighted average. Meanwhile, accuracy is scored by using both the 
weighted and unweighted average. There are two steps in the evaluation phase. First, each configuration is 
scored by the chosen evaluation metrics. The second step is to analyze the evaluation score by comparing 
every data input configuration. 


3. RESULTS AND DISCUSSION 
3.1. HiFi-GAN data augmentation result 

As mentioned, HiFi-GAN is utilized in this work to either replace or duplicate the audio data from 
the IEMOCAP dataset. Figure 4 compares the mel-spectrogram of the real audio data and the augmented data 
from HiFi-GAN. The mel-spectrogram is taken for one of the samples from the anger emotion class from the 
IEMOCAP dataset. 

Figure 4(a) shows the original mel-spectrogram. Meanwhile, Figure 4(b) shows the generated 
mel-spectrogram using HiFi-GAN data augmentation. Based on Figure 4(b) can be found that HiFi-GAN 
[27] could produce an audio file that is similar to the real human-quality audio of Figure 4(a). Both displayed 
figures used the Ses01F_impro01_F012_anger speech file. Hence, the data augmentation technique using 
HiFi-GAN to tackle an event where the data source is imbalanced is an option for the SER field. 
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Figure 4. Comparison between (a) real audio and (b) HiFi-GAN generated audio 
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3.2. SER model training time 

This proposed work used both BERT and ALBERT in the text modal of the SER model. Based on 
Lan et al. [29], using ALBERT instead of BERT could improve the training time of the SER model. This 
claim is valid in this research. By choosing ALBERT, the training time per epoch is shorter than BERT, as 
presented in Table 3. 

According to Table 3, ALBERT has an 18 seconds training time per epoch. Meanwhile, BERT has a 
12 seconds training time per epoch. Hence, ALBERT has a faster training time per epoch by 6 seconds 
compared to BERT. This could save a lot of time when training bigger SER models. 


Table 3. Training time per epoch for ALBERT and BERT 


Text modal configuration Training time per epoch (seconds) 
ALBERT base 12 
BERT base 18 


3.3. SER model performance 

Based on the completed training, this research proved that the use of GAN as a data augmentation 
technique could improve multimodal SER performance. The SER performance evaluation can be seen in 
Table 4. The shown results are the best performance for each configuration. 

Aside from the two sessions [24]’s configuration, the results presented in Table 4 displayed the best 
performance achieved by each configuration. The SER configuration of two class GAN attained the best 
overall performance with a score of 0.96 or 96% for every evaluation metric. This means that the 
configuration achieved great multi-class classification performance for the IEMOCAP dataset. This 
configuration used GAN for the Happiness and Sadness emotion classes. Meanwhile, data from the 
IEMOCAP dataset are used for anger and neutral. The obtained score improves the weighted and UA 
performance from two sessions [24] significantly. The configuration resulted in an improvement of WA of 
0.25 or 25% and an VA of 0.22 or 22%. The SER configuration of three class GAN came second. The 
configuration has each evaluation metrics’ score of 0.94 or 94%. This configuration used GAN for three 
emotion classes except for the happiness emotion class. But the use of GAN in this research cannot replace 
the IEMOCAP data completely as presented by the seventh SER configuration that achieved only the score 
of 0.73 or 73%. 


Table 4. Multimodal SER model performance comparison for accuracy 


No SER configuration Transformer Use of GAN GAN percentage Weighted Unweighted Fl-scors 
model (%) accuracy accuracy 

1 Two sessions [24] BERT No - 0.71 0.74 0.74 

2 Full samples ALBERT No - 0.69 0.72 0.72 

3 4.000 samples ALBERT No - 0.73 0.72 0.72 

4 One class GAN ALBERT Yes 100 0.88 0.89 0.89 

5 Two class GAN ALBERT Yes 100 0.96 0.96 0.96 

6 Three class GAN ALBERT Yes 100 0.94 0.94 0.94 

7 Four class GAN ALBERT Yes 100 0.73 0.73 0.73 

8 Two sessions BERT Yes 100 0.72 0.71 0.71 
[24]+two class GAN 

9 Full samples+two ALBERT Yes 50 0.76 0.76 0.76 
class GAN 


The last two SER configurations created a balanced dataset by duplicating a certain percentage of 
the selected emotion classes. The twelfth SER configuration of two sessions [24]+two class GAN could not 
achieve better performance than the original SER configuration of two sessions [24] by having a lower score 
of 0.03 or 3% for UA, precision, recall, and Fl-score. On the other side, the full samples+two class GAN 
SER configuration could attain higher performance than the full samples configuration. The UA score 
increased by 0.07 or 7%, and the other evaluation metrics’ scores increased by 0.04 or 4%. 

Based on the result, the researchers thought that the increase in performance by using the GAN 
technique happened because the GAN method creates a new audio file that had a few differences from the 
original audio file from the IEMOCAP dataset, which can be seen in Figure 4. This phenomenon is caused by 
the pre-trained HiFi-GAN model, which learned the features of speech before the inference phase using the 
selected IEMOCAP data. Hence, this captures pattern similarity generally and generates samples of an 
emotion class that has a few differences from the original IEMOCAP samples. 
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4. CONCLUSION 

This study aimed to explore the use of the data augmentation technique, which is GAN, to replace or 
duplicate emotion class samples from the IEMOCAP dataset to tackle the imbalanced data problem. From the 
result, this research can conclude that the GAN method in the multimodal SER model could enhance 
performance. The best SER model configuration, which is two class GAN for the Happiness and Sadness 
emotion classes with ALBERT, achieved WA, UA, and an F1-score of 0.96 or 96%. Likewise, it is crucial to 
have a balanced dataset when creating the SER model and could be attained by using GAN as a data 
augmentation method. Further works can explore the use of GAN on other speech emotion datasets, and the 
use of other GAN configurations. Furthermore, the explanation of the SER model performance increase using 
GAN as a data augmentation method is still not explored. 
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